Single-Layer Networks: Regression
Bio-statistical Learning
Introduction
- This chapter explores basic neural network concepts via linear regression.
- Linear regression models represent a simple, single-layer neural network.
- Why start here?
- Limited practical use on their own.
- Possess simple analytical properties.
- Excellent for introducing core concepts fundamental to deep neural networks.
- Goal for today: Understand the building blocks before moving to complex architectures.
1. Linear Regression: The Goal
- Regression Task: Predict one or more continuous target variables \(t\) given a \(D\)-dimensional input vector \(\mathbf{x}\).
- Training Data: We are given \(N\) observations \(\{\mathbf{x}_n\}\) and corresponding target values \(\{t_n\}\).
- Model: We formulate a function \(y(\mathbf{x}, \mathbf{w})\) that makes predictions.
- \(\mathbf{w}\) represents a vector of learnable parameters.
- Simplest Model: A linear combination of input variables: \[y(\mathbf{x},\mathbf{w}) = w_0 + w_1x_1 + \dots + w_Dx_D \quad (1)\]
- \(\mathbf{x}=(x_{1},...,x_{D})^{T}\). \(w_0\) is the bias, \(w_1, \dots, w_D\) are weights.
- Key Property: Linear function of parameters \(w_j\).
- Limitation: Also a linear function of input variables \(x_i\), restricting its ability to model complex relationships.
1.1 Basis Functions: Adding Non-linearity
- To address the limits of simple linear regression, introduce basis functions.
- The model becomes a sum of nonlinear functions of input variables: \[y(\mathbf{x},\mathbf{w}) = w_0 + \sum_{j=1}^{M-1} w_j\phi_j(\mathbf{x})\]
- \(\phi_j(\mathbf{x})\): basis functions.
- \(M\) parameters (\(M-1\) basis + bias \(w_0\)).
- Often use a dummy basis \(\phi_0(\mathbf{x}) = 1\): \[y(\mathbf{x},\mathbf{w}) = \sum_{j=0}^{M-1} w_j\phi_j(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})\]
- \(\mathbf{w} = (w_0, \dots, w_{M-1})^T\)
- \(\phi(\mathbf{x}) = (\phi_0(\mathbf{x}), \dots, \phi_{M-1}(\mathbf{x}))^T\)
- Linear in \(\mathbf{w}\): \(y(\mathbf{x},\mathbf{w})\) can be nonlinear in \(\mathbf{x}\), but model remains linear in the parameters.
Visualizing Linear Regression with Basis Functions
Network Diagram
Figure 1: Linear regression model as a single-layer network. Each \(\phi_j(\mathbf{x})\) is an input, \(w_j\) are weights, \(y(\mathbf{x},\mathbf{w})\) is the output. The solid blue node represents the bias \(\phi_0(\mathbf{x})=1\).
Role of Basis Functions
- Before deep learning, feature extraction (choosing good \(\phi_j(\mathbf{x})\)) was crucial.
- Deep learning aims to learn these transformations from data.
Examples of Basis Functions
| Polynomials |
\(\phi_j(x) = x^j\) |
Applied to components of \(\mathbf{x}\) or scalar \(\mathbf{x}\) |
| Gaussian |
\(\phi_j(x) = \exp\left\{-\frac{(x-\mu_j)^2}{2s^2}\right\}\) |
\(\boldsymbol{\mu}_j\): center of \(\mathbf{x}\), \(s\): width. For vector \(\mathbf{x}\), form is \(\exp\left\{-\frac{||\mathbf{x}-\boldsymbol{\mu}_j||^2}{2s^2}\right\}\) |
| Sigmoidal |
\(\phi_j(x) = \sigma\left(\frac{x-\mu_j}{s}\right)\) where \(\sigma(a) = \frac{1}{1+\exp(-a)}\) |
For vector \(\mathbf{x}\), argument is often \(\mathbf{v}^T\mathbf{x} + v_0\) or similar projection. |
- While formulas are often shown for scalar \(x\) for simplicity (as in Figure 2), in a \(D\)-dimensional input space \(\mathbf{x}\), these functions would operate on \(\mathbf{x}\) or its components
Visualizing Basis Functions (1D Example)
a. Polynomials (\(x^j\))
b. Gaussians (\(\exp(-(x-\mu_j)^2/2s^2)\))
c. Sigmoids (\(\sigma((x-\mu_j)/s)\))
Figure 2: Examples of basis functions plotted against a single variable \(x\).
- For now, our discussion is largely independent of the specific choice of basis functions \(\phi_j(\mathbf{x})\).
- We’ll focus on a single target variable \(t\) for simplicity.
1.2 Likelihood Function: Probabilistic View
- Assume target variable \(t\) is the model prediction \(y(\mathbf{x},\mathbf{w})\) plus additive Gaussian noise \(\epsilon\): \[t = y(\mathbf{x},\mathbf{w}) + \epsilon \quad (7)\]
- \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) (zero-mean Gaussian noise with variance \(\sigma^2\)).
- This implies a conditional probability distribution for \(t\): \[p(t|\mathbf{x},\mathbf{w},\sigma^2) = \mathcal{N}(t|y(\mathbf{x},\mathbf{w}), \sigma^2) \quad (8)\]
- Given a dataset \(X = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}\) and targets \(\mathbf{t} = \{t_1, \dots, t_N\}\), assuming data points are drawn independently: Likelihood Function: \[p(\mathbf{t}|X,\mathbf{w},\sigma^2) = \prod_{n=1}^{N} \mathcal{N}(t_n|\mathbf{w}^T\phi(\mathbf{x}_n), \sigma^2) \quad (9)\]
Log-Likelihood and Error Function
It’s often easier to work with the log-likelihood: \[\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2) = \sum_{n=1}^{N} \ln \mathcal{N}(t_n|\mathbf{w}^T\phi(\mathbf{x}_n), \sigma^2)\]
Using the form of a univariate Gaussian: \[\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2) = -\frac{N}{2}\ln\sigma^2 - \frac{N}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \quad (10)\]
Let’s define the sum-of-squares error function: \[E_D(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \quad (11)\]
Substituting (11) into (10): \[\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2) = -\frac{N}{2}\ln\sigma^2 - \frac{N}{2}\ln(2\pi) - \frac{1}{\sigma^2}E_D(\mathbf{w}) \quad (12)\]
Key Insight: Maximizing the log-likelihood (w.r.t. \(\mathbf{w}\)) under a Gaussian noise assumption is equivalent to minimizing the sum-of-squares error \(E_D(\mathbf{w})\).
1.3 Maximum Likelihood Solution for \(\mathbf{w}\)
- To find \(\mathbf{w}_{ML}\), we maximize \(\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2)\) w.r.t. \(\mathbf{w}\).
- This is equivalent to minimizing \(E_D(\mathbf{w})\). Setting the gradient of \(E_D(\mathbf{w})\) w.r.t \(\mathbf{w}\) to zero: \[\nabla_{\mathbf{w}} E_D(\mathbf{w}) = -\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}\phi(\mathbf{x}_n)\] (Note: \(\phi(\mathbf{x}_n)\) is a column vector, so \(\phi(\mathbf{x}_n)^T\) for the derivative was implicit in prior slides, here making \(\phi(\mathbf{x}_n)\) itself the vector for the gradient term.)
- Setting to zero gives: \[0 = \sum_{n=1}^{N}t_n\phi(\mathbf{x}_n)^T - \mathbf{w}^T\left(\sum_{n=1}^{N}\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T\right) \quad (\text{compare with } 4.13)\]
- Solving for \(\mathbf{w}_{ML}\): \[\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t} \quad (15)\]
- These are the normal equations. \(\mathbf{t}\) is the vector of target values.
Design Matrix and Pseudo-Inverse
- In \(\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t}\):
- \(\mathbf{t}\) is the column vector \((t_1, \dots, t_N)^T\).
- \(\Phi\) is the \(N \times M\) design matrix: \[\Phi = \begin{pmatrix} \phi_0(\mathbf{x}_1) & \phi_1(\mathbf{x}_1) & \dots & \phi_{M-1}(\mathbf{x}_1) \\ \phi_0(\mathbf{x}_2) & \phi_1(\mathbf{x}_2) & \dots & \phi_{M-1}(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\mathbf{x}_N) & \phi_1(\mathbf{x}_N) & \dots & \phi_{M-1}(\mathbf{x}_N) \end{pmatrix} \quad (16)\] (Each \(\phi_j(\mathbf{x}_n)\) is a scalar output of the \(j\)-th basis function for the \(n\)-th input vector)
- The term \(\Phi^\dagger \equiv (\Phi^T\Phi)^{-1}\Phi^T\) is the Moore-Penrose pseudo-inverse of \(\Phi\). (17)
- So, \(\mathbf{w}_{ML} = \Phi^\dagger \mathbf{t}\).
Role of the Bias Parameter \(w_0\)
Error function with explicit \(w_0\) (where \(\mathbf{w}\) now excludes \(w_0\)): \[E_D(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}\left\{t_n - w_0 - \sum_{j=1}^{M-1}w_j\phi_j(\mathbf{x}_n)\right\}^2 \quad (18)\]
Setting \(\frac{\partial E_D}{\partial w_0} = 0\) and solving for \(w_0\): \[w_0 = \bar{t} - \sum_{j=1}^{M-1}w_j\bar{\phi}_j \quad (19)\] where \(\bar{t} = \frac{1}{N}\sum_{n=1}^{N}t_n\) and \(\bar{\phi}_j = \frac{1}{N}\sum_{n=1}^{N}\phi_j(\mathbf{x}_n)\). (20)
Interpretation: \(w_0\) compensates for differences in averages.
Maximum Likelihood Solution for \(\sigma^2\)
- Maximize log-likelihood (12) w.r.t. \(\sigma^2\): \[\sigma_{ML}^2 = \frac{1}{N}\sum_{n=1}^{N}\{t_n - \mathbf{w}_{ML}^T\phi(\mathbf{x}_n)\}^2 \quad (21)\]
- Interpretation: \(\sigma_{ML}^2\) is the average residual variance around the regression function.
1.4 Geometry of Least Squares
Geometric View
- N-dim space: axes \(t_n\). \(\mathbf{t} = (t_1, \dots, t_N)^T\).
- Basis vectors \(\boldsymbol{\varphi}_j = (\phi_j(\mathbf{x}_1), \dots, \phi_j(\mathbf{x}_N))^T\). (Each \(\boldsymbol{\varphi}_j\) is a column in \(\Phi\))
- These span an \(M\)-dim subspace \(S\) (if \(M<N\)).
- Prediction \(\mathbf{y}\) (elements \(y(\mathbf{x}_n, \mathbf{w})\)) lies in \(S\).
- \(E_D(\mathbf{w}) = \frac{1}{2}||\mathbf{y} - \mathbf{t}||^2\).
- Minimizing \(E_D(\mathbf{w})\) finds \(\mathbf{y} \in S\) closest to \(\mathbf{t}\).
- Solution: \(\mathbf{y}\) is the orthogonal projection of \(\mathbf{t}\) onto \(S\).
Diagram
Figure 3: The vector \(\mathbf{t}\) is projected onto the subspace \(S\) spanned by basis function vectors \(\boldsymbol{\varphi}_j\). The projection is \(\mathbf{y}\).
Numerical Issues: If \(\Phi^T\Phi\) is singular or ill-conditioned, consider using SVD or adding regularization to improve numerical stability.
1.5 Sequential Learning (Online Algorithms)
- \(\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t}\) is a batch method.
- Sequential (online) algorithms: Process data points one at a time.
- Good for large datasets / real-time.
- Stochastic Gradient Descent (SGD): If error \(E = \sum_n E_n\), update after point \(n\): \[\boxed{\color{blue}{\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta \nabla E_n \quad (22)}}\]
- \(\mathbf{w}^{(\tau)}\): params at iteration \(\tau\).
- \(\eta\): learning rate.
- \(\nabla E_n\): gradient for \(n\)-th data point.
LMS Algorithm (Least Mean Squares)
- For sum-of-squares error, \(E_n = \frac{1}{2}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2\).
- \(\nabla E_n = -(t_n - \mathbf{w}^T\phi(\mathbf{x}_n))\phi(\mathbf{x}_n)\). (Here \(\phi(\mathbf{x}_n)\) is the vector of basis function outputs for \(\mathbf{x}_n\))
- SGD update rule (22) becomes: \[\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} + \eta(t_n - \mathbf{w}^{(\tau)T}\phi(\mathbf{x}_n))\phi(\mathbf{x}_n) \quad (23)\]
- Known as Least-Mean-Squares (LMS) or Widrow-Hoff rule.
1.6 Regularized Least Squares
- Regularization adds a penalty to control overfitting.
- Total error function: \[E_{total}(\mathbf{w}) = E_D(\mathbf{w}) + \lambda E_W(\mathbf{w}) \quad (24)\]
- \(E_D(\mathbf{w})\): data error.
- \(E_W(\mathbf{w})\): regularization term.
- \(\lambda\): regularization coefficient. . . .
- Common regularizer: L2 regularization (ridge regression): \[E_W(\mathbf{w}) = \frac{1}{2}\sum_{j=0}^{M-1} w_j^2 = \frac{1}{2}\mathbf{w}^T \mathbf{w} \quad (25)\]
Solution for Regularized Least Squares
- Total error with L2 regularization: \[\frac{1}{2}\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2 + \frac{\lambda}{2}\mathbf{w}^T \mathbf{w} \quad (26)\]
- Set gradient w.r.t. \(\mathbf{w}\) to zero: \[(\Phi^T\Phi + \lambda I)\mathbf{w} = \Phi^T\mathbf{t}\]
- Solution: \[\mathbf{w} = (\lambda I + \Phi^T\Phi)^{-1}\Phi^T\mathbf{t} \quad (27)\]
- \(I\) is identity matrix. \(\lambda I\) helps with singularity.
- Shrinks weights towards zero.
1.7 Multiple Outputs
- To predict \(K > 1\) target variables \(\mathbf{t} = (t_1, \dots, t_K)^T\).
- Use same basis functions \(\phi(\mathbf{x})\) for all \(K\) outputs: \[\mathbf{y}(\mathbf{x},W) = W^T\phi(\mathbf{x}) \quad (28)\]
- \(\mathbf{y}(\mathbf{x},W)\): \(K\)-dim vector.
- \(W\): \(M \times K\) parameter matrix. (Each column is a weight vector \(\mathbf{w}_k\))
- \(\phi(\mathbf{x})\): \(M\)-dim basis vector.
Network for Multiple Outputs
![]()
Figure 4: Linear regression for multiple outputs \(y_1, \dots, y_K\). Each output \(y_k\) is a linear combination of the basis functions \(\phi_j(\mathbf{x})\) with its own set of weights (a column in \(W\)).
Likelihood for Multiple Outputs
- Assume isotropic Gaussian conditional distribution: \[p(\mathbf{t}|\mathbf{x},W,\sigma^2) = \mathcal{N}(\mathbf{t}|W^T\phi(\mathbf{x}), \sigma^2I) \quad (29)\]
- Log-likelihood for \(N\) observations (targets \(T\) as \(N \times K\) matrix): \[\ln p(T|X,W,\sigma^2) = \sum_{n=1}^{N} \ln \mathcal{N}(\mathbf{t}_n|W^T\phi(\mathbf{x}_n), \sigma^2I)\] \[= -\frac{NK}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{n=1}^{N}||\mathbf{t}_n - W^T\phi(\mathbf{x}_n)||^2 \quad (30)\]
ML Solution for Multiple Outputs
- Maximizing log-likelihood (30) w.r.t. \(W\): \[W_{ML} = (\Phi^T\Phi)^{-1}\Phi^T T \quad (31)\]
- For each column \(\mathbf{w}_k\) of \(W_{ML}\) (params for \(k\)-th output) and \(\mathbf{t}_k\) (vector of \(N\) targets for \(k\)-th output) of \(T\): \[\mathbf{w}_k = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t}_k = \Phi^\dagger\mathbf{t}_k \quad (32)\]
- Key Result: Regression decouples for each target. Pseudo-inverse \(\Phi^\dagger\) is shared.
2. Decision Theory: Making Predictions
- We’ve learned to model \(p(t|\mathbf{x})\), e.g., \(\mathcal{N}(t|y(\mathbf{x},\mathbf{w}_{ML}), \sigma_{ML}^2)\). This is our predictive distribution.
- But often, we need to make a single, concrete prediction \(f(\mathbf{x})\).
- Analogy: A weather model gives a 70% chance of rain (\(p(t|\mathbf{x})\)). You need to decide: “Take an umbrella” or “Leave it” (\(f(\mathbf{x})\)).
- Two Stages:
- Inference Stage: Learn \(p(t|\mathbf{x})\) from training data. (We’ve done this!)
- Decision Stage: Choose an optimal prediction \(f(\mathbf{x})\) using \(p(t|\mathbf{x})\) and a loss function \(L(t, f(\mathbf{x}))\).
- A loss function measures the “cost” or “error” if the true value is \(t\) and we predict \(f(\mathbf{x})\).
What’s a “Good” Prediction? Expected Loss
- How do we measure how “wrong” our prediction \(f(\mathbf{x})\) is compared to the true value \(t\)?
- A very common way for regression is the squared loss: \[L(t, f(\mathbf{x})) = \{f(\mathbf{x})-t\}^2\]
- Why squared? It punishes larger errors much more than small errors. It’s also mathematically convenient!
- We want to choose \(f(\mathbf{x})\) that minimizes the average or expected loss over all possible values of \(\mathbf{x}\) and \(t\): \[\mathbb{E}[L] = \iint \{f(\mathbf{x})-t\}^2 p(\mathbf{x},t)d\mathbf{x} dt \quad (35)\]
- \(p(\mathbf{x},t)\) is the true joint probability of \(\mathbf{x}\) and \(t\).
- Our goal: Find \(f(\mathbf{x})\) that makes \(\mathbb{E}[L]\) as small as possible.
Optimal Prediction for Squared Loss
- If we use the squared loss \(L(t,f(\mathbf{x})) = \{f(\mathbf{x})-t\}^2\):
- The prediction \(f^*(\mathbf{x})\) that minimizes the expected squared loss \(\mathbb{E}[L]\) is the conditional mean of \(t\) given \(\mathbf{x}\): \[\boxed{f^*(\mathbf{x}) = \mathbb{E}_t[t|\mathbf{x}] = \int t\, p(t|\mathbf{x})\, dt} \quad (37)\]
- This means: for any given input \(\mathbf{x}\), the “best” prediction is the average of all possible true target values \(t\) that could occur for that \(\mathbf{x}\).
- This \(f^*(\mathbf{x})\) is often called the regression function.
- For our Gaussian model: \(p(t|\mathbf{x}) = \mathcal{N}(t|y(\mathbf{x},\mathbf{w}), \sigma^2)\).
- The conditional mean is simply: \(\mathbb{E}_t[t|\mathbf{x}] = y(\mathbf{x},\mathbf{w}) \quad (38)\)
- So, the optimal prediction is our model’s output \(y(\mathbf{x},\mathbf{w}_{ML})\)!
Figure 5: The optimal prediction \(f^*(x)\) (red curve) is the mean of the conditional distribution \(p(t|x_0)\) (blue curve) for each input \(x_0\). (Shown for scalar \(x\) for visualization)
Decomposing the Expected Squared Loss
- Let’s look closer at the expected squared loss (eq. 35). If we make the prediction \(f(\mathbf{x})\), the expected loss can be broken down: \[\mathbb{E}[L] = \underbrace{\int \{f(\mathbf{x})-\mathbb{E}[t|\mathbf{x}]\}^2 p(\mathbf{x})d\mathbf{x}}_{\text{Term 1: Our Model's Contribution to Error}} + \underbrace{\int \text{var}[t|\mathbf{x}] p(\mathbf{x})d\mathbf{x}}_{\text{Term 2: Irreducible Error (Noise)}} \quad (\text{derived from } 39)\]
- Term 1: Our Model’s Contribution
- This part depends on how different our prediction \(f(\mathbf{x})\) is from the ideal prediction \(\mathbb{E}[t|\mathbf{x}]\) (the true conditional mean).
- If we choose \(f(\mathbf{x}) = \mathbb{E}[t|\mathbf{x}]\) (the optimal prediction), this term becomes zero!
- Term 2: Irreducible Error (Noise)
- \(\text{var}[t|\mathbf{x}]\) is the variance of \(t\) given \(\mathbf{x}\). It’s the inherent randomness or “noise” in the data that our model cannot predict.
- This term represents the minimum achievable expected loss, even with a perfect model that knows \(\mathbb{E}[t|\mathbf{x}]\).
- This is the fundamental limit due to the nature of the data itself.
3. The Bias-Variance Trade-off: The Challenge
- Goal: We want models that predict well on new, unseen data (generalization).
- But models can be tricky!
- Too simple? Might miss the true pattern (underfitting).
- Too complex? Might fit the training noise, not the pattern (overfitting).
- The Bias-Variance Trade-off helps us understand this.
- Analogy: The Archer
- High Bias Archer: Consistently misses the bullseye in the same direction (e.g., always high and left). Their average shot is off. (Model is too simple, systematically wrong).
- High Variance Archer: Shots are scattered all around the target. No consistent miss, but widely spread. (Model is too sensitive, predictions change wildly with different data).
- Good Archer: Hits near the bullseye, consistently (Low Bias, Low Variance). This is our goal!
Understanding Prediction Errors: Bias and Variance
- Let \(h(\mathbf{x}) = \mathbb{E}[t|\mathbf{x}]\) be the true, optimal regression function (the bullseye).
- Our model \(f(\mathbf{x}; \mathcal{D})\) is trained on a specific dataset \(\mathcal{D}\).
- If we could average over many possible datasets \(\mathcal{D}\), the expected squared difference between our model’s prediction \(f(\mathbf{x}; \mathcal{D})\) and the true function \(h(\mathbf{x})\) for a given \(\mathbf{x}\) can be broken down:
\[\mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - h(\mathbf{x})\}^2] = \underbrace{\{\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})] - h(\mathbf{x})\}^2}_{\text{(bias)}^2} + \underbrace{\mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\}^2]}_{\text{variance}} \quad (44)\]
- Bias (Squared): \(\{\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})] - h(\mathbf{x})\}^2\)
- \(\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\) is the average prediction our model type would make for input \(\mathbf{x}\), if trained on many different datasets.
- Bias measures how far this average model prediction is from the true function \(h(\mathbf{x})\).
- High bias: Model is fundamentally “off target”, too simple, or makes systematic errors.
- Variance: \(\mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\}^2]\)
- Measures how much our model’s predictions \(f(\mathbf{x}; \mathcal{D})\) scatter or change if we train it on different specific datasets \(\mathcal{D}\).
- High variance: Model is too sensitive to the training data; it overfits the noise and doesn’t generalize well.
Overall Expected Loss Decomposition
Integrating over all \(\mathbf{x}\), the total expected loss of our model is: \[\text{Expected Loss} = (\text{Bias})^2 + \text{Variance} + \text{Noise} \quad (\text{based on } 45)\]
Where:
- \((\text{Bias})^2 = \int \{\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})] - h(\mathbf{x})\}^2 p(\mathbf{x})d\mathbf{x} \quad (46)\)
- Systematic error from model being too simple.
- \(\text{Variance} = \int \mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\}^2] p(\mathbf{x})d\mathbf{x} \quad (47)\)
- Error from model’s sensitivity to specific training data (overfitting).
- \(\text{Noise} = \int \text{var}[t|\mathbf{x}] p(\mathbf{x})d\mathbf{x} \quad (\text{var}[t|\mathbf{x}] = \mathbb{E}_t[\{t-h(\mathbf{x})\}^2|\mathbf{x}])\)
- Irreducible error due to inherent data variability.
Our Challenge: Minimize \((\text{Bias})^2 + \text{Variance}\). There’s often a trade-off: - Very simple models: Low Variance, High Bias. - Very complex models: High Variance, Low Bias.
Quantitative Bias-Variance Trade-off
Figure 8: Plot of squared bias, variance, their sum, and test error versus \(\ln \lambda\) for a 1D example.
- \(\ln \lambda\) on x-axis: Controls model complexity.
- Large \(\ln \lambda\) (right): Strong regularization, simpler model.
- Small \(\ln \lambda\) (left): Weak regularization, more complex model.
- Observe the Trade-off:
- Simple models (right): High Bias, Low Variance.
- Complex models (left): Low Bias, High Variance.
- The Total Error (Bias² + Variance) has a minimum. This is the “sweet spot” for \(\lambda\) we want to find!
- Regularization (choosing \(\lambda\)) is key to navigating this trade-off.
- (Approximate calculations for specific data points \(\mathbf{x}_n\): \(\bar{f}(\mathbf{x}_n) = \frac{1}{L}\sum_l f^{(l)}(\mathbf{x}_n)\) \((\text{bias})^2 \approx \frac{1}{N_{test}}\sum_n \{\bar{f}(\mathbf{x}_n) - h(\mathbf{x}_n)\}^2\) \(\text{variance} \approx \frac{1}{N_{test}}\sum_n \frac{1}{L}\sum_l \{f^{(l)}(\mathbf{x}_n) - \bar{f}(\mathbf{x}_n)\}^2\) )
Limitations of Bias-Variance View
- Practical Value: Hard to calculate exact bias and variance for real problems (needs many datasets or strong assumptions).
- Conceptual Value: Extremely useful for intuition about model complexity, underfitting, overfitting, and the role of regularization. It’s a powerful mental model.
- Frequentist Concept: The Bias-Variance decomposition is primarily a frequentist idea. Bayesian approaches handle model complexity and overfitting differently (e.g., through marginalization).
Summary of Single-Layer Regression Networks
- Linear regression as a simple neural network, predicting \(t\) from input vector \(\mathbf{x}\).
- Basis functions \(\phi_j(\mathbf{x})\) for nonlinear relationships (model \(y(\mathbf{x},\mathbf{w})\) is linear in parameters \(\mathbf{w}\)).
- Max likelihood (Gaussian noise) \(\iff\) min sum-of-squares error.
- Closed-form solution (normal equations for \(\mathbf{w}_{ML}\)).
- Sequential learning (LMS/SGD).
- Regularization (L2) controls overfitting by managing model complexity.
- Decision theory guides how to make optimal predictions (for squared loss, predict the conditional mean \(\mathbb{E}[t|\mathbf{x}]\)).
- Bias-variance trade-off explains the relationship between model complexity, bias (underfitting), variance (overfitting), and generalization. Regularization helps find a good balance.
Questions & Discussion
- Thank you!
- Adapted from the book “Deep Learning” by Bishop & Bishop